-
Notifications
You must be signed in to change notification settings - Fork 118
aws: support multipart copy for objects larger than 5GB #561
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
4719ef4 to
09cef9b
Compare
09cef9b to
acc8cc4
Compare
|
I think this probably warrants a higher level ticket to discuss how we should support this, as a start it would be good to understand how other stores, i.e. GCS and Azure handle this, so that we can develop an abstraction that makes sense. In particular I wonder if adding this functionality would make more sense as part of the multipart upload functionality? This of course depends on what other stores support. In general filing an issue first to get consensus on an approach is a good idea before jumping in on an implementation |
|
Great, created #563. |
|
@tustvold updated to pass CI and a small tweak to avoid overflow. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @james-rms and @tustvold -- the high level idea seems reasonable to me, but I think this code needs tests (maybe unit tests?) or something otherwise we may break the functionality inadvertently in some future refactor
a8d535c to
1fdf7d6
Compare
|
@tustvold I've refactored slightly for unit tests and added a couple of integration tests as well. Please take another look when you have some time. |
src/aws/mod.rs
Outdated
| mode, | ||
| extensions: _, | ||
| } = options; | ||
| // Determine source size to decide between single CopyObject and multipart copy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there some way we can avoid this, e.g. we try CopyObject normally and on error fallback to multipart? Otherwise this adds an additional S3 roundtrip to every copy request.
| .unwrap(); | ||
|
|
||
| let mut payload = BytesMut::zeroed(10 * 1024 * 1024); | ||
| rand::fill(&mut payload[..]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
tustvold
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, and well tested, sorry for the delay in reviewing, been absolutely swamped.
I think we need to find a way to avoid regressing the common case of files smaller than 5GB, e.g. by first attempting CopyObject and then falling back if it errors (I am presuming S3 gives a sensible error here).
This is what we get back from S3: As explained by AWS: So the real question is if we make a CopyObject call, can we assume that any InvalidRequest that comes back is because the object was >5GB in size. Given the documentation I think that's OK? but i'm not really clear that it will remain OK going forward. I'll push a commit that does this and let you decide on the approach. |
|
I guess another option would be to make this disabled by default and therefore opt-in... How do other clients handle this? |
|
Go S3 SDK leaves this up to the caller to figure out: https://pkg.go.dev/github.com/aws/aws-sdk-go-v2/service/s3#Client.CopyObject Go Cloud SDK seems to not support copies >5GB, because copy calls are forwarded directly to the CopyObject API, and it doesn't expose a multipart copy method: https://github.com/google/go-cloud/blob/a52bb6614a70209265758ad7a795a4a3931fbe0b/blob/s3blob/s3blob.go#L856 |
Co-authored-by: Raphael Taylor-Davies <[email protected]>
|
@tustvold i've implemented this as disabled by default (>5GB copies will just fail) and opt-in. By default there is no pre-copy head request. |
Which issue does this PR close?
Rationale for this change
Today, users that attempt to copy a >5GB object in S3 using
object_storewill see this error:The way to get around this problem per AWS's docs is to do the copy in several parts using multipart copies. This PR adds that functionality to the AWS client.
It adds two additional configuration parameters:
The defaults are chosen to minimise surprise: if people are used to copies not requiring several requests, we don't switch to that method until it's absolutely necessary, and when necessary, we use as few parts as possible.
What changes are included in this PR?
See above.
Are there any user-facing changes?
Yes - these configuration parameters should be covered by the docstring changes.